K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Similar documents
Topic 23: Diagnostics and Remedies

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Lectures on Simple Linear Regression Stat 431, Summer 2012

Linear models and their mathematical foundations: Simple linear regression

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Formal Statement of Simple Linear Regression Model

Basic Business Statistics 6 th Edition

Confidence Intervals, Testing and ANOVA Summary

Ch 2: Simple Linear Regression

One-way ANOVA Model Assumptions

STAT5044: Regression and Anova

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Statistics for Managers using Microsoft Excel 6 th Edition

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Multiple Linear Regression

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Simple Linear Regression

Chapter 14. Linear least squares

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Inference for the Regression Coefficient

STATISTICS 479 Exam II (100 points)

Applied Econometrics (QEM)

Correlation Analysis

4.1. Introduction: Comparing Means

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Analysing data: regression and correlation S6 and S7

Regression. Marc H. Mehlman University of New Haven

Scatter plot of data from the study. Linear Regression

Inferences for Regression

Chapter 3. Diagnostics and Remedial Measures

Lecture 11: Simple Linear Regression

Inference for Regression

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Diagnostics and Remedial Measures

The factors in higher-way ANOVAs can again be considered fixed or random, depending on the context of the study. For each factor:

Measuring the fit of the model - SSR

Diagnostics and Remedial Measures: An Overview

Remedial Measures, Brown-Forsythe test, F test

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Inference for Regression Inference about the Regression Model and Using the Regression Line

R 2 and F -Tests and ANOVA

Residual Analysis for two-way ANOVA The twoway model with K replicates, including interaction,

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

3rd Quartile. 1st Quartile) Minimum

Review of Statistics 101

Scatter plot of data from the study. Linear Regression

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

MLR Model Checking. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Density Temp vs Ratio. temp

Lecture 9: Linear Regression

STAT2012 Statistical Tests 23 Regression analysis: method of least squares

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

What is a Hypothesis?

STAT 3A03 Applied Regression With SAS Fall 2017

Applied Regression Analysis

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Simple Linear Regression Using Ordinary Least Squares

Assessing Model Adequacy

Linear Regression Model. Badr Missaoui

Chapter 4: Regression Models

Inference for Regression Simple Linear Regression

Unit 10: Simple Linear Regression and Correlation

Mathematics for Economics MA course

STAT 3A03 Applied Regression Analysis With SAS Fall 2017

Notes 6. Basic Stats Procedures part II

Stat 427/527: Advanced Data Analysis I

ANOVA (Analysis of Variance) output RLS 11/20/2016

Homework 2: Simple Linear Regression

Correlation and Regression

Review of Statistics

The Simple Linear Regression Model

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Single and multiple linear regression analysis

More about Single Factor Experiments

Answer Keys to Homework#10

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Assignment 9 Answer Keys

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Lecture 1: Linear Models and Applications

Chapter 16. Simple Linear Regression and Correlation

STAT 705 Chapter 16: One-way ANOVA

Ch 3: Multiple Linear Regression

ST505/S697R: Fall Homework 2 Solution.

Simple Linear Regression

Chapter 16. Simple Linear Regression and dcorrelation

Statistics for exp. medical researchers Regression and Correlation

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Math 3330: Solution to midterm Exam

Multiple Linear Regression

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Transcription:

K. Model Diagnostics We ve already seen how to check model assumptions prior to fitting a one-way ANOVA. Diagnostics carried out after model fitting by using residuals are more informative for assessing model assumptions, because all covariate effects have been removed. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij MSE (standardized) studentized residuals ɛ ij = ˆɛ ij MSE(1 1/ri ) approx N approx N studentized deleted residuals ɛ ij = ˆɛ ij SSE(1 1/r i ) ˆɛ 2 ij N t 1 approx t N t 1 = Y ij Ŷ i( j) Ŷ i( j) fitted mean Ȳ i from a model fit after deleting Y ij. 157

SAS code in OUTPUT line S-PLUS or R code Residual of PROC GLM fit = lm(y factor(x)) ˆɛ ij R or RESIDUAL fit$residuals ω ij calculate in a DATA step resids/summary(fit)$sigma ɛ ij STUDENT see code file lmwork.s ɛ ij RSTUDENT see code file lmwork.s Some properties: i j ˆɛ ij = 0 thus the ˆɛ ij are not... Var[ˆɛ ij ] = Var[Y ij Ȳ i ] = σ2 (r i 1) the model is correct. r i σ 2 but Var[ ɛ ij ] = 1 when 158

What do we need to check? We will use the studentized residuals ɛ ij for most diagnostics. 159

Tools - Plots Plot residuals versus fitted values - look for outliers - look for an even scatter of points above and below the horizontal at zero (indicating homoscedasticity) - if the r i are small, also plot residuals versus fitted values stem-and-leaf or histogram of residuals - look for outliers - look for approximate symmetry around 0 - look for approximate bell shape 160

normal probability plot of residuals or normal quantile plot of residuals - look for the residuals to follow the standard normal straight line spread-location plot of residuals versus fitted values - look for an even vertical scatter of points - superimpose the within-group median{ residuals } and look for any trend across groups plot residuals versus observation number, or plot in the order in which the data were collected - look for a random scatter of points - any trend may indicate lack of independence 161

plot residuals versus any predictor omitted from the model - look for a random scatter of points around the horizontal at 0, which indicates the predictor is not needed in the model 162

Tools - Statistics Outliers - ɛ ij > 3-68% of ˆɛ ij should fall within ( 1,1) - 90% of ˆɛ ij should fall within ( 2,2) Normality - skewness of residuals should be 0 - kurtosis of residuals should be 3 (note that SAS PROC UNIVARIATE gives kurtosis minus 3) 163

Tools - Tests Outliers - if max {ɛ ij } > t α/2n,n t 1 then Y ij is an outlier. Why use type I error of α/2n? Normality - reject H 0 : residuals are normally distributed at level α if corr(ˆɛ ij, E[ˆɛ ij ]) < q α from the table for the correlation test for normality. What is E[ˆɛ ij ]? 164

Homoscedasticity - Hartley test (Fmax): if the assumptions of independence and normality hold and r i r i then we can test H 0 : σ 2 1 = σ2 2 = = σ2 t versus H A : not all σ 2 i are equal for σ 2 i = Var[Y ij]. Reject H 0 at level α if F max = max(s2 i ) min(s 2 i ) > Fmax α,t,r 1. F max is a distribution derived for this test and can be found in tables. If the r i are close but not all equal, use df= 1 t not r 1. i(r i 1) 165

- Modified Levene (Median) test: if the assumption of independence holds and of normality approximately holds, we can test H 0 : σ 2 1 = σ2 2 = = σ2 t versus H A: not all σ 2 i are equal using the data medians: Ỹ i = median j i {Y ij }. 1. Compute z ij = Y ij Ỹ i 2. Fit a oneway ANOVA using the z ij 3. Reject H 0 at level α if F = MST MSE > F α,t 1,N t 166

L. Remedial Measures What is the effect of the failure of the one-way ANOVA model assumptions? Moderate lack of normality will lead to only a slight loss of power. Kurtosis has a greater impact on power than skewmess. ˆµ i, Ĉ, and MSE are unbiased with or without normally distributed errors. If r i are not all equal, then a violation of homoscedasticity can affect the power of the F-test. If the r i are approximately equal, then non-constant variance will only have a mild impact on the F-test. 167

Violation of the independence assumption is potentially the most serious, especially if the ignored correlation is large (ρ > 0.5). An ignored positive correlation will give variance estimates (e.g., Var[ˆµ i ], ˆ Var[Ĉ]) that are too small, thus null hypotheses may be rejected when they should not be. Outliers usually do not have a big impact since the F-test is fairly robust to skewness. Omitting important covariates can have a large impact on the estimated means and their interpretation, and consequently on the F-test as well. Violation of normality has a larger impact on confidence intervals than on F-tests. 168

Remedial measures are methods we use to try to fix the violated assumptions. Outliers - Fit the model once with the outliers, and once without. Compare the two fitted models (ˆµ i, F-test, contrasts of interest). If they are not substantially different in terms of scientific conclusions, then leave the outliers in. - Always check to make sure outliers are not just the result of a data entry error, equipment malfunction, or miscalculation. If they are, then the outliers should be corrected or omitted. - If the two models give substantially different conclusions, then both sets of results should be reported, or an alternative analysis technique should be used. 169

Omitted covariates - If an omitted covariate appears to be important from a residual plot, then add it to the model and test it for statistical significance. - If you know an important covariate was omitted, but it was not collected, or you do not have access to it, there will be problems with model interpretation. Independence - If you know the source of the correlation, then you can fit a random effects model to adjust for it. - If you do not, then move to working independence estimates of a robust sandwich estimator in generalized estimating equations. 170

Normality is satisfied but homoscedasticity is not. Suppose the violation is such that ɛ ij iid N(0, σ 2 i ). Since the σi 2 are unknown, they will need to estimated using s 2 i = r 1 r i i 1 j=1 (Y ij Ȳ i ) 2. Having non-constant variance means that ˆµ i = Ȳ i no longer have minimum variance among all unbiased linear estimators. We must adjust for the groups with larger variances. How do we do that? 171

Instead of minimizing least squares i j(y ij µ i ) 2, we will minimize weighted least squares i j w ij (Y ij µ i ) 2, where w ij = We still get ˆµ i = Ȳ i, but our sums of squares will now be weighted as well. least squares SST = i r i (Ȳ i Ȳ ) 2 SSE = i j(y ij Ȳ i ) 2 weighted least squares SST = i r i(ȳ i Ȳ ) 2 SSE = i s 2 i j 1 s 2 i (Y ij Ȳ i ) 2 172

Now F = MSE MST will only have an approximate F distribution. Larger r i better approximation. Coding: SAS - WEIGHT statement in PROC GLM S-PLUS & R - lm(, weights = ) If you saw weighted least squares in regression, this is the same thing. We just need to write the ANOVA model in the regression parameterization, and use ˆβ = 173

Neither normality nor homoscedasticity are satisfied. (1) Transform the data, the Y ij values. Watch out for negative and 0 values, which affect how transformations can be done. (a) If σ 2 i = cµ i then try Y ij. Plot s 2 i versus Ȳ i and look for an increasing or decreasing linear trend. Or compute s 2 i /Ȳ i and look for them to take on a similar value i. (b) If σ i = cµ i then try log(y ij + k) for some small k. Plot s i versus Ȳ i or compute s i /Ȳ i as above. (c) If σ i = cµ 2 i then try 1 Y ij +k for some small k. Plot s i versus Ȳi 2 or compute s i /(Ȳ i ) 2 as above. 174

(d) If Y ij is a proportion then try log(y ij ) log(1 Y ij ), i.e., log odds. If the proportions come from differently sized samples, then also try weighted least squares. (e) If none of the above work, then try the Box - Cox procedure for finding an appropriate power transformation try a non-linear mdoel, e.g. generalized linear model or non-parametric regression. What are the diasadvantages of fitting models on a transformed outcome? 175